Data Science Pipeline Tutorial: College Data

In this tutorial, we want to run through the data science pipeline from collecting data, cleaning and parsing it, performing exploratory data analysis, hypothesis testing, and machine learning.

This tutorial is designed in python, using a custom dataset adapted from the U.S. Department of Education's College Scorecard. To download the full dataset and documentation, visit https://collegescorecard.ed.gov/data/. We will be focusing on data from the Most Recent Data by Field of Study table in order to learn information about different universities' revenue, tuition costs, racial diversity, and more through the use of the data science pipeline.

Setup

First, we import the libraries we want to use. We will be using a pandas dataframe to represent and manipulate the data, so we import the pandas library. Matplotlib and seaborn are both useful for plotting data visually, and folium is great for mapping. Numpy and sklearn will help us plot the data and run linear regressions.

Loading and Viewing Data

Next, we extract the data from online by using pandas's read_csv() function and the url we use to access the data. If your data is stored locally, read_csv() also works on the path to the file where your data is stored.

We've uploaded the data that we'll be using onto a csv file on github, so we're going to import from that url.

This data is now loaded into our dataframe, df. A pandas dataframe is a 2-dimenstional structure that stores data, like a spreadsheet. With dataframes, pandas gives us many ways to interact with our data. This dataframe has each row representing a university, and each column representing one attribute of the university, such as its name, latitude and logitude, website url, and more.

However, it turns out we can't view all of the data. Let's change up some settings in pandas to view all the columns.

Missing Data

In this data, you can see that some values in some columns are listed as "NaN", meaning "Not a Number". This means that there is missing data for this column from the selected university. There are many ways to deal with missing data, such as trying to fill in reasonable values or simply removing the associated row or column. In order to best determine what to do about missing data, let us first evaluate which columns have the most missing data.

The columns missing the most data are the columns for number of Black, non-hispanic undergraduates, the number of White non-hispanic undergraduates, as well as statistics about mean SAT and ACT section scores.

This could be for various reasons. For example, the dataframe already has columns for Black students, White students, and Hispanic students, so likely they did not also count the overlapping Black and Hispanic or White and Hispanic students. Instead, they kept them separated. As a result, they also did not categorize the Black Non-Hispanic and White Non-Hispanic students. In this case, we do not know how to estimate how many of the Black and White students are not Hispanic, so we will opt to simply drop these columns.

The SAT and ACT sections may be missing data due to factors like test-optionality when students are applying, a lack of separating out the various scores once they are collected, or other reasons. We will opt to remove these columns too.

The last column we want to delete is 'Unnamed: 0', which is an indexing of the rows. There's no need for that since the dataframe tracks rows on its own, so we'll drop that column as well.

Dropping columns is quite simple when you know the names of the target columns.

Exploratory Data Analysis

There are about 3400 rows in our dataframe currently, which means there are a lot of Universities to look at. Here we can see them all mapped using the folium library, with a dot marker for each university.

Since we want to explore the relationships between different attributes of these universities, we have to first do some additional analysis. One thing we can do is create a correlation matrix that shows whether two variables are closely related, not seemingly related, or even negatively correlated. The darker colors show a strong correlation, and the lighter colors show a weaker correlation, or even a negative correlation.

Some obeservations we can make from this correlation matrix is that tuition rates are somewhat correlated with the revenue per student, and they are somewhat negatively correlated with admission rate. One interesting observation that has some relevancy to current events is the correlation between admissions rate and certain racial breakdowns, as well as racial breakdowns by longitude and latitude.

In order to further explore geographical trends in racial demographics, we can plot the different percentage of students of each rate on a scatterplot by location. Grouping racial demographic by latitude and longitude is less meaningful in the US, where multiple states with a variety of socioeconomic situations exist at similar latitudes or longitudes. Instead, we can look at the demographic breakdown of universities by state and by regions in the US.

The US can be classified by 5 regions: West, Midwest, NorthEast, SouthEast, and SouthWest

Here we can see that the population breakdown varies by region of the United States. We are hypothesizing in this toturial that the admission rates are also different for different races in the United States.

Hypothesis Testing

Let's take a look at race vs. acceptance rate. Hypothesis testing is an important step in analyzing data, as it gives a quanitative method to guage whether two elements are actually related. For this step, we used a machine learning technique of attempting to fit a linear regression to the data. A linear regression is an algorithm used when the range of values is continuous, as cost of admissions is. This method attempts to predict the value of one variable using the value of another by fitting a straight line.

For each race, we will fit a linear regression to the population per university as the cost of attendance increases

For linear regression, hispanic individuals had the steepest coefficient of -.027, while Black individuals had 2nd at -.019, and the rest at values closer and closer to zero. Looking at the graphs, we can see that White and Black are both pretty strongly centered around the linear regression line, but black deviates from the line more than white does further left at the lower cost. This is what leads it to have a smaller coefficient in its linear regression line. Asian on the other hand has less of a deviation on the left hand side, and is balanced out by a deviation on the right side of the graph as well. The data here is pretty well spread out, leading to a coefficient of just about 0. Hispanic and AIAN individuals deviated from the linear regression line in the lower values, and Native Hawaiian and Pacific Islanders stayed close to the slope of the regression line, close to 0.

Conclusion:

From these linear regressions, we can see that the correlations between race and acceptance rate may have been overstated in the correlation matrix earlier. As acceptance rate increases, the coefficient of the linear regression was very small for most races, and even 0 for some. This does not suggest a strong correlaiton between these variables. In the future, it may be worth exploring the relationship between the racial breakdown of US regions and these acceptance rates, since limiting to an area may yield more specific results.